[KERNEL] Extend `getStatsSchema` to include collated stats #5380

ilicmarkodb · 2025-10-22T01:02:58Z

Which Delta project/connector is this regarding?

Description

In this PR, getStatsSchema is extended to include the collated stats schema for any collation referenced by DataSkippingPredicate (schema example). Also, added E2E tests for collated data skipping.

How was this patch tested?

New tests.

Does this PR introduce any user-facing changes?

No.

ilicmarkodb · 2025-10-22T01:17:41Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/DataSkippingPredicate.java

-  /**
-   * Set of {@link CollationIdentifier}s referenced by this predicate or any of its child
-   * expressions
-   */
-  private final Set<CollationIdentifier> collationIdentifiers;
-


I don't think we should have this field. We already have collationIdentifier in Predicate, so having this just increases complexity. Also, getReferencedCollations is only called once in the codebase, so we don't lose much by not persisting its value.

allisonport-db

Few minor comments/questions on the src code. Will review the tests in a bit.

kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/DataSkippingPredicate.java

kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/StatsSchemaHelper.java

allisonport-db · 2025-10-22T20:43:18Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/StatsSchemaHelper.java

+  /**
+   * Given a data schema and a set of collation identifiers returns the expected schema for
+   * collation-aware statistics columns. This means 1) replace logical names with physical names 2)
+   * set nullable=true 3) only keep collated-stats eligible fields (`StringType` fields)
+   */


Is this what we do in delta spark for the stats read schema? Or is there an optimization where we only read the collated stats for columns referenced in the predicate (with that collation)?

Maybe this is a future optimization we could consider?

Yes, we can optimize it like that. We can do this for both collated and binary stats. Let's leave this for future optimization so we don't complicate this PR further.

kernel/kernel-api/src/main/java/io/delta/kernel/internal/skipping/StatsSchemaHelper.java

kernel/kernel-api/src/main/java/io/delta/kernel/internal/ScanImpl.java

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

allisonport-db

Just a few minor comments + pls fix the failing tests. Afterward, LGTM

allisonport-db · 2025-10-23T01:04:40Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/ScanSuite.scala

+    }
+  }
+
+  test("partition and data skipping - combined pruning on partition and data columns") {


nit - add collation desc to title

allisonport-db · 2025-10-23T01:05:05Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/ScanSuite.scala

    }
  }

+  test("data skipping - predicates with SPARK.UTF8_BINARY on data column") {


can you add a test case over nested string column?

allisonport-db · 2025-10-23T01:08:57Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/skipping/DataSkippingUtilsSuite.scala

+  test("pruneStatsSchema - collated min/max columns") {
+    val utf8Lcase = CollationIdentifier.fromString("SPARK.UTF8_LCASE")
+    val unicode = CollationIdentifier.fromString("ICU.UNICODE")
+    val testSchema = new StructType()


nit - I think you could define vals to make this simpler that can be reused?

val nestedField = ... val s1Field = ... val allFields = ... // (s1 + i1 + i2 + nested)

temp

dc9cba1

ilicmarkodb force-pushed the extend_stats_schema branch from 4840c33 to dc9cba1 Compare October 22, 2025 01:08

ilicmarkodb added 3 commits October 22, 2025 03:13

fix test

58ab107

fix test

cb21750

fix test

9dfa0e1

ilicmarkodb commented Oct 22, 2025

View reviewed changes

ilicmarkodb added 4 commits October 22, 2025 03:19

scala style

81560bb

make use of immutableUnion

59a0ca9

change comment

bcebf4a

java style

f4ffc71

allisonport-db reviewed Oct 22, 2025

View reviewed changes

ilicmarkodb added 10 commits October 22, 2025 23:32

address comments

607d309

java style

655188d

update comment

06add00

pruneStatsSchema tests added

651a251

fix test

48ede99

temp

06f2312

fix

63e6b7c

fix test

308bece

fix test

3f58796

fix test

3d7d9d7

allisonport-db reviewed Oct 23, 2025

View reviewed changes

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java Outdated Show resolved Hide resolved

revert1

608ae42

allisonport-db reviewed Oct 23, 2025

View reviewed changes

fix test

80c4ed2

Uh oh!

[KERNEL] Extend getStatsSchema to include collated stats #5380

Are you sure you want to change the base?

[KERNEL] Extend getStatsSchema to include collated stats #5380

Uh oh!

Conversation

ilicmarkodb commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

ilicmarkodb Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

allisonport-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allisonport-db Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

ilicmarkodb Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allisonport-db left a comment

Choose a reason for hiding this comment

Uh oh!

allisonport-db Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

allisonport-db Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

allisonport-db Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[KERNEL] Extend `getStatsSchema` to include collated stats #5380

[KERNEL] Extend `getStatsSchema` to include collated stats #5380

ilicmarkodb commented Oct 22, 2025 •

edited

Loading